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Abstract. - We introduce a framework for network analysis based on random walks on directed 
acyclic graphs where the probability of passing through a given node is the key ingredient. We 
illustrate its use in evaluating the mutual influence of nodes and discovering seminal papers in a 
citation network. We further introduce a new similarity metric and test it in a simple personalized 
recommendation process. This metric's performance is comparable to that of classical similarity 
metrics, thus further supporting the validity of our framework. 



The past two decades have witnessed a network rev- 
olution [1] fueled by the ever-increasing computer com- 
putational power at our disposal and by the availability 
of rich datasets mapping virtually all fields of human ac- 
tivity Complex networks and algorithms based on 
these resources found their application in the most diverse 
fields, ranging from nonlinear dynamics and critical phe- 
nomena [HIS] to social and economic systems [B] . Random 
walks are among the most prominent classes of processes 
taking place on networks, being employed in importance 
rankings for the World Wide Web [7], recommender sys- 
tems [8], disease transmission models 0, nodes similar- 
ity [To] and many other areas |TI]. 

A relatively less-studied class of networks is represented 
by directed acyclic graphs (DAGs) which occur in both 
natural and artificial systems. Their acyclicity (absence of 
directed cycles) stems either from an implicit time order- 
ing (as in citation networks where only past papers can 
be cited) or from natural constraints (as in food webs). 
Even when nodes of a DAG do not have time stamps at- 
tached, a causal structure with all edges pointing from 
later to earlier nodes can always be recovered. Theoret- 
ical models exist for building random DAGs with fixed 
degree sequences or with fixed expected degrees [T^fTB] . 

Acyclicity turns out to be highly advantageous to filter 
information through a random walk process. If we con- 
sider a random walk on a generic network, the probability 
of passing through a given node — which we refer to as pas- 
sage probability — is usually not a meaningful quantity as 



it may well be equal to one for all nodes in the network. 
The situation is rather the opposite if we instead consider 
a DAG, as every random walk along the network's edges 
comes to an end when a root node with zero out-degree is 
reached. 

In this Letter we introduce an analytical framework for 
DAGs to quantify the influence of one node over another 
based on the passage probability and discuss its applica- 
tions. In particular we propose a method to identify pa- 
pers fundamental to the growth of a given research area 
and define a new similarity metric. Relation to PageRank, 
which has been used to citation data before [M] (see [H] 
for a historical perspective of PageRank and other fields 
of its applicability), is also discussed. We test our frame- 
work on citation data provided by the American Physical 
Society and we show that: i) the proposed method is able 
to uncover seminal papers even if they do not have partic- 
ularly high citation counts, (ii) the similarity metric per- 
forms well when used as a component of a simple recom- 
mendation algorithm |16) . Note that the time dimension, 
neglected by many information filtering techniques, is im- 
plicitly taken into account by acting on a DAG. While we 
use academic citation data to test our model and often 
refer to papers and citations instead of nodes and edges, 
majority of this work is general and applicable to other 
DAGs such as those representing family trees and refer- 
ence networks of patents [17] and legal cases [18] . 

Consider a directed acyclic graph composed of A'' nodes 
and L directed edges pointing from newer to older nodes. 
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Fig. 1: Comparison of a random walk starting at X (a) with 
passing of "genes" (b). According to the description in the 
main text, Ci = for i — 1,2, 3, C4 — |(ci + C2 + C3) + 64, 
C5 = C3 + 65, C6 = i(c4 + C5 + C3) + 66 = |ei + |e2 + |e3 + 
1 64 + 1 65 + eg. Coefficients in ce agree with the corresponding 
passing probabilities in (a). Note that while the random walk 
proceeds from top to bottom, genetic composition propagates 
from bottom to top. 

In- and out-degree of node x are denoted as fc™ and 
respectively. We further denote by Ax the set of nodes 
that can be reached from node x (a;'s ancestors) and by 
Vx the set of nodes from which x can be reached (a:;'s 
progeny). Since the network is acyclic, Va; : Ax^iVx — 0- 
A random walk starting in node x can be encoded in an 
A^-dimensional vector Gx whose ith component represents 
the probability of passing through node i (see Fig. [T^ for 
an illustration). Thanks to the network's acyclicity, Gx 
fulfills the equation 

Gx - WG, (1) 

where W is the transition matrix with elements Wij — 
if i cites j and Wij — otherwise. The boundary 
condition for Eq. ([1]) is given by {Gx)x — 1 which reflects 
that any random walk certainly passes through its starting 
point. (One can also obtain Gx by simply following the 
random walk starting at node x as it is done in Fig. [T^.) 
Elements of Gx are by definition positive for all nodes in 
Ax and zero for all other nodes. Nodes without out-going 
links are represented by a zero column in W and act as 
sinks for the random walk. 

To obtain a compact formalism, we construct an x 
matrix G where column x is equal to Gx- Elements of 
this matrix have simple interpretation: Gyx represents 
the probability of passing through node y when starting 
in node x. One may check that Gyx = X]^o(^")a^ 
(since W is a transition matrix, {\N")yx is the probabil- 
ity of moving from x to y over a path of length n). Note 
that while Eq. ([1} reminds an equation for stationary oc- 
cupation probabilities, this not the case: Unlike the clas- 
sical random walk utilized by PageRank, the stationary 
occupation probability here is zero for all nodes due to 
the presence of sinks (the relation between our framework 
and PageRank is discussed in detail below). This concept 
can be readily generalized for a weighted DAG by assum- 
ing that the probability of choosing an outgoing edge is 



proportional to the edge's weight. 

It is instructive to complement the above random walk 
approach with an analogy based on genes spreading in a 
population. In the context of citation data, consider vec- 
tors of "genetic" composition of papers and assume that 
each paper's vector is obtained by averaging the vectors of 
the cited papers (inherited knowledge) and by adding the 
paper's contribution (new knowledge). A similar model 
based on genetic composition of scientific papers has been 
shown to reproduce many quantitative features of sci- 
ence ]19 . Fig. [T] illustrates this process on a toy net- 
work. For example, Cg = |(ci -I- C4 -t- C5) + eg where eg 
represents contribution of paper 6 which is, by definition, 
orthogonal to contribution vectors of all previous papers. 
Vectors ei, 62, . . . therefore constitute a basis of a space 
of growing dimension. The accumulation of knowledge is 
reflected in the lack of normalization of the composition 
vectors Cx which are of greater magnitude for recent pa- 
pers than for old ones. From a correspondence between all 
possible paths from x to y and possible ways how compo- 
sition Cy can propagate to x, it is straightforward to show 
that when composition of a paper is written in terms of 
the base vectors, coefficients of respective base vectors are 
equal to the passage probabilities obtained by the random 
walk approach and hence Cx = Gx (see Fig. [T|). We can 
say that the previously introduced passage probabilities 
Gx represent influence of past papers on paper x and, at 
the same time, "genetic" composition of paper x. 

Given our understanding that Gxy quantifies the influ- 
ence of x on y, we may introduce the total aggregate im- 
pact of node x 

Ix Gxy (2) 

y 

where the number of non-zero terms in the summation is 
Px := \Vx\ (which we refer to as the progeny size of node 
x). The value Ix is not meaningful by itself because it is 
naturally biased by the size of Vx ■ This makes it sensitive 
to the time of the paper's appearance (old nodes tend to 
have greater progenies) and to the amount of literature in 
this paper's research field. It is therefore more informative 
to plot Ix vs Px- A large value of Ix / Px is achieved when 
the influence of x is effectively channeled to the papers 
in Vx'- for example when even papers that do not cite 
X directly refer mostly to papers citing x- Therefore we 
expect outliers in the plane {Px,Ix) to be seminal papers 
which founded new branches of research. 

It is illustrative to discuss the relation between the ag- 
gregate impact Ix and the Google PageRank score. To do 
that, we combine Eqs. ^ and ^ to write Ix as a solution 
of the self-consistent equation 

Ix = l + J2Wv.Iy (3) 

y 

where Ix ■— I for all nodes without progeny (i.e., fc™ = 0). 
The structure of this equation resembles that of the clas- 
sical PageRank equation. The similarity can be enhanced 



p-2 



Influence, originality and similarity in directed acyclic graphs 




A-C 



80 120 
R-V MlQ 



D-G 



200 300 
P 1 10^ 



400 



Fig. 2: Total influence of papers Ix versus their progeny size 
Px for the APS citation data (for clarity, only 413 papers with 
Ix > 20+Pr/400 are shown). Details about the marked outliers 
are given in Tab. [T] 



further if instead of the "gene" composition spreading dis- 
cussed above, we consider its normalized version. This 
normalized spreading is achieved by assuming that each 
paper's genetic vector is composed by a fraction (1 — a) 
of its original contribution plus a fraction a of the average 
over its parents' genetic vectors (thus the vector's norm 
is fixed to one for all papers with at least one ancestor). 
Hence we obtain a new matrix of genetic composition, G" 
which in turn can be used to compute new aggregate im- 
pact The self-consistent equation for /" now has the 
form 

IS = l-a + aJ2Wy.J^ (4) 
y 

where /" := 1 — a for all nodes without progeny. Up to re- 
placing 1 — a with {l — a)/N (which only affects the overall 
scale of /"), this equation is identical to the equation of 
the PageRank: a and 1 — a are the probabilities that the 
random walk follows an existing link and jumps, respec- 
tively, and /" is the PageRank value of node x. Since the 
term 1 — a only sets the scale of /" and in the limit a 1 
the propagation term a Wyxly in Eq. (g]) is equal to 
that in Eq. ([3]), we see that rankings of nodes according 
to the aggregate impact Ix and the limit PageRank value 
lima^i /" are equivalent. 

Both Ix and are naturally biased by the progeny size 
of node x. In the case of this bias can be partially re- 
moved by setting a < 1 which leads to impact spreading 
mainly over a local neighborhood. In the case of Ix, we 
remove the bias by placing the nodes in the plane {Ix, Px) 
which allows us to better distinguish exceptional nodes 
than the one-dimensional PageRank value with one pa- 
rameter {a). While PageRank certainly has its merit for 
the WWW, in what follows we attempt to show that infiu- 
ence and impact propagating without damping are useful 
for DAGs. 

We now illustrate our ideas on the citation data pro- 
vided by the American Physical Society (APS). This data 
contains all 449 705 papers published by the APS from 



1893 to 2009 together with their citations to the APS jour- 
nals. To make the data strictly acyclic, we do not consider 
a small number of citations that are between papers of 
the same print date; we are then left with 4 672 812 cita- 
tions. Fig. [3] shows all papers published by the APS after 
1940 and reveals an expected linear relationship between 
Ix and Px with several outstanding papers whose influ- 
ence is much greater than that of other papers of the same 
progeny size. (Papers published before 1940 are omitted 
because of the data sparseness which is amplified by the 
limitation of our data to citations to and from the APS 
journals.) Table[T]lists the outliers together with scientific 
prizes as a proxy for their quality. While our results are af- 
fected by using only the APS citation^, one can conclude 
that majority of these outlying papers really represents 
exceptional research. While it is not our goal to rank the 
papers, one could achieve that for example by dividing Ix 
by the average Ix of papers with the same progeny size 
Px, thus making papers of different age comparable. 

Outliers in the {Px,Ix) plane often do not have partic- 
ularly high citation counts. When we apply the classical 
PageRank algorithm to our data as in [2] , we observe than 
many of them do not receive high PageRank values. The 
differences stem, of course, from differences between the 
algorithms. While PageRank is a reputation metric |20) 
awarding papers cited by other reputable papers, our ap- 
proach focuses on the progeny created by each individual 
paper. In consequence, even a paper which is not directly 
cited by popular papers can score high if it establishes 
a new research direction or a school of thought. In this 
sense, our approach evaluates originality of papers. On 
the other hand, interdisciplinary works necessarily focus 
the flow of influence less and hence they are not likely to 
score high with respect to the Ix/Px criterion. 

We finally note that the definition of the PageRank score 
/" in Eq. 0] allows for a meaningful research of outliers in 
the {Ix,k'T) plane (see [H]), similarly as we do in the 
{Ix,Px) plane for the aggregate impact Ix- While some 
papers appear as outliers in both planes, there are some 
significant differences which further demonstrate the dis- 
tinction between our evaluation metric and the PageRank 
(see Fig. [3]). These differences, marked with bold letters 
in Table [U correspond to relatively recent but seminal 
papers, suggesting that our method is more effective in 
removing the inherent time bias of citation data discussed 
above. 

After showing that our concept of influence quantified 
by the G matrix has its merit, we use it to evaluate simi- 
larity of papers. The basic idea is that papers x and y are 



^ For example, paper P which is not (to the best of our knowledge) 
particularly outstanding owes its high total impact to the fact that 
it is the only paper in the APS data cited by the high-impact paper 
Q. Since paper Q in reality cites many more papers, paper P prob- 
ably wouldn't excel if complete citation data would be used for the 
analysis (this has been already discussed in |f 4|). Similar problems 
arise for those research fields where the original work was not pub- 
lished on APS journals (take high-temperature superconductivity, 
for example). 
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Id 


Title 


Authors 


Year 


Prize 


PR 


CR 


A 


Statistics of tliG Two-Diniensioiicil FGrroiiia,gnGt. 


H. A. Kramers, G.H. Wannier 


1941 


LM 


54 


1 645 


B 


Crystal Statistics in a Two-DirnGiisional K^odcl. 


L. Onsager 


1944 


NP 


8 


87 


c 


Theory of Superconductivity 


J. Bardeen, et al. 


1957 


NP 


2 


10 


D 


The Maser — New Type of A^Iicrowave Amplifier,. 


J. Gordon et oX, 


1955 


NP 


369 


14 517 


E 


Infrared and Optical A^asers 


A. Schawlow C. Townes 


1958 


NP 


171 


2 108 


F 


Population Inversion and Continuous Optical Maser 


A. Javan et al. 


1961 


-1- 


169 


14 517 


G 


Dynamical A^Iodcl of Elementary Particles Based on. 


Y. Nambu G. Jona-Lasinio 


1961 


NP 


24 


50 


H 


Self-Consistent Ec^uations Including Exchange and. 


W. Kohn, L. Sham 


1965 


NP 


1 


1 


I 


Inliomogeneous Electron Gas 


P. Hohenberg, W. Kohn 


1964 


MPM 


3 


2 


J 


A M^odel of Leptons 


S. Weinberg 


1967 


NP 


6 


18 


K 


Static Phenomena Near Critical Points!. 


L Kadanoff, et al. 


1967 


MPM 


58 


355 


L 


Radiative Corrections as the Origin of Spontaneous. 


S. Coleman, E. ^Veinberg 


1973 


DM 


31 


75 


M 


Scaling Theory of Localization:. . . 


E. Abrahams, et al. 


1979 


NP 


11 


24 


N 


Nev^ Measurement of the Proton Gyromagnetic Ratio. . . 


E.R. Williams, P.T. Olsen 


1979 




150 


26 327 


o 


Nevi^ Method for High-Accuracy Determination of. . . 


K. Klitzing 


1980 


NP 


32 


134 


P 


Cluster Formation in Two-Dimensional Random Walk 


H. Rosenstock, C. Marquardt 


1980 




109 


217 150 


O 


Diffusion-Limited Aggregation. . . 


T.A. Witten, L.M. Sander 


1981 




17 


64 


R 


Electronic Structure of BaPb]^_^Bi^03 


L.F. Mattheiss, D.R. Hamann 


1983 




106 


4 224 


s 


Bulk Superconductivity at 36 K in Lax s^^O 2CUO4 


R.J. Cava et al. 


1987 




37 


1 086 


T 


Evidence for Superconductivity above 40 K In. 


C.W. Chu et al. 


1987 




40 


606 


u 


Superconductivity at 93 K in a New M!ixed-Phase. 


M.K. Wu et al. 


1987 




19 


102 


V 


Self- Organized Criticality: An Explanation of. . . 


P. Bak et al. 


1987 


-(- 


16 


47 


a 


Teleporting an Unknown Quantum State via. . . 


C.H. Bennett et al. 


1993 


+ 


53 


26 


b 


Bose-Einstein Condensation in a Gas of Sodium Atoms 


K.B. Davis et al. 


1995 


NP 


63 


27 


c 


Evidence of Bose-Einstein Condensation in. . . 


C.C. Bradley et al. 


1995 


-f 


99 


51 


d 


TeV Scale Superstring and Extra Dimensions 


G. Shiu, S.-H.H. Tye 


1998 




216 


3 991 


e 


Small-World Networks: Evidence for a Crossover Picture 


M. Barthelemy, L.A.N. Amaral 


1999 


+ 


658 


9 872 


f 


Negative Refraction Makes a Perfect Lens 


J.B. Pendry 


2000 


DM 


279 


192 


g 


Composite Medium with Simultaneously Negative. . . 


D.R. Smith et al. 


2000 


+ 


433 


459 


h 


Statistical Mechanics of Complex Networks 


R. Albert, A.-L. Barabasi 


2002 


VNM 


112 


59 



Table 1: An approximately time-ordered list of the papers marked in Fig. [2] (labels agree with those marked in the figure). 
To evaluate the quality of the list, we indicate the most important prize received by the authors for research pertinent to the 
listed papers (LM=Lorentz Medal, NP=Nobel Prize, MPM=Max Planck Medal, DM=Dirac Medal, VNM=John Von Neumann 
Medal). Important prizes are rarely awarded soon after a discovery is made and this bias is well visible in our table. To overcome 
this, we add an additional distinguishing criterion for prize-free papers: if they are described as pioneering works in a certain 
domain on Wikipedia, we mark them with The last two columns show the paper's ranking given by the Page Rank score 
when a — 0.5 (PR) and the citation count (CR). Bold labels correspond to the papers not detectable as outliers in Fig. [S] 

we take 

S*ix,y) = ^y^GixGiy. (5) 

i 

It is also possible to base the similarity on mm{Gix, Giy} 
or GixGiy, for example — we present here the choice per- 
forming best in our numerical tests. Note that this sim- 
ilarity is not normalized: its lower bound is zero but the 
upper bound is bounded only by AxCiAy. We stress that 
S* is parameter-free and hence practical to use. 

The standard way to evaluate a similarity metric is to 
test how well it is able to reproduce missing links in a 
network [2T1[22]. In practice this means that small part of 
links (usually 10%) is removed from the network and one 
attempts to guess the removed links by seeing which sim- 
ilar nodes are not connected. A similarity metric which is 
able to "repair" well the network presumably captures well 
the network's structure and one may use it also for other 
purposes than link prediction. In the case of our similar- 
ity metric S'*, we adopt a slightly different approach: we 
test how good recommendations it is able to provide to se- 
lected individuals. This change is motivated by potential 
practical use of such recommendations for scientists who 
often face the problem of searching for relevant literature 
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Fig. 3: PageRank with a — 0.5 vs citation count (with an older 
version of the APS data, a similar plot was already presented in 
|14]). Outliers from Fig. [5] are marked either with red squares 
(if they can be considered as outliers also in this figure) and 
with blue crosses (if they are not outliers here — these papers 
have their number written in bold in Table [TJ. 



similar if they are influenced by the same works (they have 
similar "genetic" composition). To evaluate this similarity 
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in their research field 

Our tests are done as follows. We first divide the data 
in two parts: papers published until year 2003 (the sam- 
ple set — it contains approximately 75% of all papers) and 
those published after 2003 (the probe set). Then we find 
20 most-cited articles published in each core APS journal 
in 2003 (we consider seven journals: Phys. Rev. Lett., 
Rev. Mod. Phys. and Phys. Rev. A-E) and take their 
last authors if they published at least one paper with the 
APS after 2003. Recommendations are made for each test 
author separately on the basis of papers published by this 
author in 2003. Denoting the set of papers published by 
author a in 2003 as I4a , the recommendation score of pa- 
per X is given by its similarity with all y in this set 

r,= ^5*(x,y). (6) 

Papers that haven't been cited by author a until 2003 are 
then sorted according to their score in a descending order 
and those at the top represent personalized recommenda- 
tion for this author. 

Resulting recommendations are evaluated using the 
probe set which allows us to label as "relevant" those pa- 
pers that were eventually cited by a given author after 
2003. To curb the level of noise in the results, we discard 
authors with less than 10 relevant papers to be guessed. 
Then we are left with the final set of 99 test authors who 
have on average 116 relevant items to be guessed out of 
almost 340 000 papers published until 2003. To assess the 
recommendations, we use metrics often used in the field of 
recommender systems |16| : (i) precision Pioo (the fraction 
of the top 100 places of the recommendation list occupied 
by the relevant papers), (ii) recall i?ioo (the fraction of 
the relevant papers appearing at the top 100 places of the 
recommendation list), (iii) the average ranking of the rel- 
evant papers qu (expressed as a fraction of all potentially 
relevant papers), and (iv) the fraction of the relevant pa- 
pers with non-zero score Jr. A good recommendation list 
should have relevant papers at the top, i.e., high Pioo and 
i?ioo and low qr, and it should assign non-zero scores to 
most relevant papers, e.g. high fn (all these quantities lie 
in the range [0, 1]). 

To test our similarity, we compare its performance in 
a recommendation process with other similarity metrics. 
Based on results presented in [22], we have selected three 
highly performing metrics: the Common Neighbors simi- 
larity (CN), the Resource Allocation Index (RA), and the 
Katz-based similarity (KA). Since they are all defined on 
undirected networks, we evaluate them assuming that all 
links in our data are undirected. CN simply counts the 
number of common neighbors for a pair of nodes. RA 
does the same but it values less common neighbors with 
many connections, 

5«^(x,y)= irWr' (7) 

zer{x)nr{y) 
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Fig. 4: Precision and recall (a) and average ranking of rele- 
vant items and fraction of ranked relevant items (b) for S^^ 
(red symbols, dotted lines), S* (blue symbols, solid lines) and 
gRA ^jjjg^;,]^ symbols, dashed lines). S^^ shows a strong de- 
pendency on the maximal distance with best Pioo and -Rioo 
achieved when the maximal distance is 3. However, is only 
0.79 at this point which means that at this level of truncation, 
it represents a transition between local and global similarity 
metrics. When all powers of A are included, S^^ performs 
poorly with respect to all measured characteristics but By 
contrast, the performance of S* decreases only slightly when 
the maximal distance is above eight. 

where T{x) is the set of direct neighbors of node x. We 
finally employ a commonly used similarity, KA, which 
counts the number of paths between two given nodes with 
individual paths weighted exponentially less according to 
their length (this similarity has a close relation with the 
Katz centrality measure 23 ). Denoting the network's ad- 
jacency matrix with A, KA can be written in the form of 
a series 

oo 

5^^(x,y)=^/3^(A*).,. (8) 

i=l 

In our case, we use /3 = 0.75 which yields slightly superior 
performance. Local similarities S'-^^ and are compu- 
tationally considerably less demanding than global (based 
on the whole network) similarities S* and S^"^. For prac- 
tical reasons, we limit the computation of S* to papers 
that are not more than six steps from both x and y. For 
S^^, we limit its summation to the order A^^ (see Fig. H] 
for how these restrictions affect the results). 

Similarities described above can be substituted for 
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S*{x, y) in Eq. leading to recommendations which can 
be in turn compared with those obtained with S* . Test 
results can be found in Fig. |4] where we plot performances 
of different algorithms vs the maximal distance used to 
compute global similarities. Results for the Resource Al- 
location Index are indicated with flat lines while results 
for the Common Neighbor similarity are omitted because 
they are always worse than for RA. In general we see a 
good performance of with respect to precision and 
recall. This is because local metrics rank only a small set 
of papers (local neighborhoods) where there is high prob- 
ability of finding relevant papers. The drawback is that 
only a minor part of relevant papers is found {/r « 0.4) 
and their average ranking is poor {qn « 0.3). 

At the same time, global metrics S* and S^^ are able 
to rank almost all relevant objects and achieve much lower 
average ranking, but they pay for this enhanced 'variety' 
with worse performance at top places of their recommen- 
dation lists. When the maximal distance of five or more 
is considered (which is necessary for making S^'^ a truly 
global similarity metric with « 1 , S"* significantly out- 
performs S^^ and, from the point of view of recommen- 
dation, provides a good compromise between global and 
local metrics. This is despite the fact that S^^ and S^^ 
are computed on undirected data which gives them access 
to more information: they assign similarity also to nodes 
with overlapping progeny, not only to those with overlap- 
ping ancestors as S* does. Further tests show that if we 
prevent S^^{x, y) from accessing this information, its pre- 
cision and recall decrease to 0.104 and 0.124 respectively 
which is comparable to the results obtained with 5'*. We 
may conclude that S* is a reliable similarity metric which 
is able to compete with other known metrics. 

In conclusion, our results unveil the value of the passage 
probability in random walks on DAGs. On the example 
of scientific citations we showed that it allows us to quan- 
tify the influence of a given paper (node) on the others, to 
identify seminal and innovative papers (i.e., instrumental 
nodes of the network), and to introduce a similarity metric 
whose performance is comparable with that of other state- 
of-the-art metrics. In this Letter, we aimed at simplicity 
and hence we didn't consider additional effects that may 
have impact on the interpretation of the analyzed citation 
data. For example, we didn't consider that every paper re- 
lies on general knowledge which is however never cited. To 
reflect that, one could for example add an artificial node 
referred by every other node in the network and repeat the 
same analysis as we did. Further, similarly as for PageR- 
ank [5S] , our framework also lends itself to generalizations 
based on assigning past citations with lower weights to 
better reflect current relevance or, more generally, trends. 
We believe that our framework might prove useful well 
beyond citation networks as it opens possibilities for the 
investigation of asymmetric interactions in DAGs by ex- 
ploiting their intrinsic acyclic nature. The presented ideas 
and tools can be readily applied to citation networks re- 
lated to any kind of intellectual production such as patents 



and legal cases. Similar networks of dependency relations 
can also be found in biology (phylogenetic networks and 
food webs, for example) as well as in other systems that 
can be mapped into a DAG, where individuation of fun- 
damental nodes and estimation of dependency relations 
within the graph can be useful and non-trivial tasks. 
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